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Abstract 

Background: Simulated nucleotide or amino acid sequences are frequently used to assess the performance of 
phylogenetic reconstruction methods. BEAST, a Bayesian statistical framework that focuses on reconstructing 
time-calibrated molecular evolutionary processes, supports a wide array of evolutionary models, but lacked matching 
machinery for simulation of character evolution along phylogenies. 

Results: We present a flexible Monte Carlo simulation tool, called 7rBUSS, that employs the BEAGLE high 
performance library for phylogenetic computations to rapidly generate large sequence alignments under complex 
evolutionary models. 7iBUSS sports a user-friendly graphical user interface (GUI) that allows combining a rich array of 
models across an arbitrary number of partitions. A command-line interface mirrors the options available through the 
GUI and facilitates scripting in large-scale simulation studies. ttBUSS may serve as an easy-to-use, standard sequence 
simulation tool, but the available models and data types are particularly useful to assess the performance of complex 
BEAST inferences. The connection with BEAST is further strengthened through the use of a common extensible 
markup language (XML), allowing to specify also more advanced evolutionary models. To support simulation under 
the latter, as well as to support simulation and analysis in a single run, we also add the ttBUSS core simulation routine 
to the list of BEAST XML parsers. 

Conclusions: ttBUSS offers a unique combination of flexibility and ease-of-use for sequence simulation under 
realistic evolutionary scenarios. Through different interfaces, ttBUSS supports simulation studies ranging from modest 
endeavors for illustrative purposes to complex and large-scale assessments of evolutionary inference procedures. 
Applications are not restricted to the BEAST framework, or even time-measured evolutionary histories, and ttBUSS can 
be connected to various other programs using standard input and output format. 
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Background 

Recent decades have seen extensive development in phy- 
logenetic inference, resulting in a myriad of techniques, 
each with specific properties concerning evolutionary 
model complexity, inference procedures and performance 
both in terms of speed of execution and estimation 
accuracy. With the development of such phylogenetic 
inference methods comes the need to synthesize evolu- 
tionary data in order to compare estimator performance 

^Correspondence: filip.bielejec@rega.kuleuven.be 

1 Department of Microbiology and Immunology, Rega Institute, KU Leuven, 
Leuven, Belgium 

Full list of author information is available at the end of the article 



and to characterize strengths and weaknesses of different 
approaches (e.g. [1,2]). Whereas the true underlying evo- 
lutionary relationships between biological sequences are 
generally unknown, Monte Carlo simulations allow gener- 
ating test scenarios while controlling for the evolutionary 
history as well as the tempo and mode of evolution. 
This has been frequently used to compare the perfor- 
mance of tree topology estimation (e.g. [3]), but it also 
applies to evolutionary parameter estimation and ances- 
tral reconstruction problems (e.g. [4]). In addition, Monte 
Carlo sequence simulation has proven useful for assess- 
ing model adequacy (e.g. [5]) and for testing competing 
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evolutionary hypotheses (e.g. [6]). It is therefore not sur- 
prising that several general sequence simulation programs 
have been developed (e.g. Seq-Gen [7]), but also inference 
packages that do not primarily focus on tree reconstruc- 
tion, such as PAML [8] and HyPhy [9], maintain code to 
simulate sequence data under the models they implement. 

As a major application of phylogenetics, estimating 
divergence times from molecular sequences requires 
an assumption of roughly constant substitution rates 
throughout evolutionary history [10]. Despite the restric- 
tive nature of this molecular clock assumption, its applica- 
tion in a phylogenetic context has profoundly influenced 
modern views on the timing of many important events 
in evolutionary history [11]. Following a long history of 
applying molecular clock models on fixed tree topolo- 
gies, the Bayesian Evolutionary Analysis by Sampling 
Trees (BEAST) package [12] fully integrates these mod- 
els, including more realistic relaxed clock models [13,14], 
in a phylogenetic inference framework. Despite its pop- 
ularity this framework has lacked a flexible and efficient 
simulation tool. Here, we address this pitfall by intro- 
ducing a parallel BEAST/BEAGLE utility for sequence 
simulation (7rBUSS) that integrates substitution models, 
molecular clock models, tree-generative (coalescent or 
birth-death) models and trait evolutionary models in a 
modular fashion, allowing the user to simulate sequences 
under different parameterizations for each module. 

7TBUSS readily incorporates the temporal dimension of 
evolution through the possibility of specifying different 
molecular clock model. Further, many models and data 
types available for BEAST inference are matched by their 
simulation counter-parts in tt BUSS, including relatively 
specific processes, such as for discrete phylogeography 
with rate matrices that can be sparse or non-reversible 
[15] that are generally beyond the scope of most sequence 
simulation tools. The BE AST- tt BUSS connection is fur- 
ther reinforced by the fact that 7TBUSS can easily gener- 
ate simulation specification in XML format for BEAST. 
Finally, we implement the core simulation routine within 
the BEAST code-base to ensure a shared XML syntax 
between the two packages and to allow for joint simulation 
and inference analysis using a single input file. 

Implementation 

Through different implementations, we support several 
sequence simulation procedures that balance between 
ease-of-use and accessibility, to model complexity. On 
the one hand, the core simulation routine can be per- 
formed following specifications in an XML input file that 
is understood by BEAST (Figure 1A). This procedure 
provides the most comprehensive access to the tt BUSS 
arsenal of models, but may require custom XML edit- 
ing. On the other hand, tt BUSS also represents a stand- 
alone package that conveniently wraps the simulation 



routines in a user-friendly graphical user interface (GUI), 
allowing users to set up and run simulations by load- 
ing input, selecting models from drop-down lists, setting 
their parameter values, and generating output in differ- 
ent formats (Figure IB). To facilitate scripting, 7TBUSS 
is further accessible through a command-line interface 
(CLI), with options that mirror the GUI. The simula- 
tion routines are implemented in Java and interface with 
the Broad-platform Evolutionary Analysis General Likeli- 
hood Evaluator (BEAGLE) high-performance library [16] 
through its application programming interface (API) for 
computationally intensive tasks. 

The core of tt BUSS consists of a recursive tree-traversal 
that is independent of the BEAST inference machinery. 
The algorithm simulates discrete state realizations by vis- 
iting the tree nodes in pre-order fashion, i.e., parental 
nodes are visited before child nodes. When a child node 
is visited, 7rBUSS samples its state from the conditional 
probabilities of changing to state j given state i at the 
parental node. For a branch length t and clock rate r, the 
finite-time transition probability matrix P(rx^) is cal- 
culated through the eigen-decomposition of the infinites- 
imal rate matrix Q along that branch. For a review of 
methods to numerically approximate a matrix exponen- 
tial, we refer to [17]. By sharing the set of XML parsers 
with BEAST, we simplify the simultaneous development 
of both packages and facilitate the ability to perform joint 
simulation and inference analyses. 

Program input 

The core implementation of the software can be invoked 
by loading an XML file with simulation settings in the 
BEAST software. The simulation procedure requires a 
user-specified tree topology or a set of taxa with their 
heights (inversely proportional to their sampling time) for 
which a tree topology can be simulated using a coales- 
cent model. Setting all heights to 0 would be equivalent to 
contemporaneously-sampled taxa. In 7rBUSS, such a tree 
can be loaded in NEXUS or NEWICK format, or a taxa 
list can be set-up in the Data panel for subsequent coa- 
lescent simulation of the genealogy. Creating the latter is 
further facilitated by the ability to load a tab-delimited file 
with a set of taxa and their corresponding heights. The 
input tree or taxon list can also be specified through the 
command-line interface of 7rBUSS. 

Program output 

7rBUSS generates sequence output in FASTA or NEXUS 
format but it also supports XML output of the simula- 
tion settings. The XML provides a notation for the models 
used, it can also be used to store a record of the set- 
tings. Similar to BEAuti for BEAST, 7rBUSS can generate 
an xml template for editing more complex simulations, 
or this can be amended with BEAST analysis settings 
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Figure 1 Overview of the it BUSS simulation procedures and GUI screenshot. A. Schematic representation of the different ways to employ the 
7TBUSS simulation software. Based on an XML input file, simulations can be performed using the core implementation. BEAST can parse the 
specified 7rBUSS instructions and generate sequence data as well as analyze the replicate data in a single run. Using both the GUI or CLI, 7iBUSS can 
run simulations based on an input tree or a list of taxa and their heights. The software can also write the simulation settings to an XML file that can 
be then read by BEAST. B. The screenshot example shows the set-up of a codon position partitioned simulation in the Partitions panel of the 
graphical user interface. The Hasegawa, Kishino and Yano (HKY) model is being set as the substitution model for partition 1 , with a k (the 
transition-transversion bias) parameter value of 4.0. 



order to directly analyze the generated sequence data, 
which avoids writing to an intermediate file. The tuto- 
rial hosted on 7rBUSS webpage provides examples of these 
possibilities. 

Models of evolution 

7rBUSS is capable of generating trees from a list of taxa 
using simple coalescent models, including a constant 
population size or exponential growth model. The soft- 
ware supports simulation of nucleotide, amino acid and 
codon data along the simulated or user-specified phy- 
logeny using standard substitution models. For nucleotide 



data, the Hasegawa, Kishino and Yano model (HKY; [18]), 
the Tamura Nei model (TN93; [19]) and the general 
time-reversible model (GTR; [20]) can be selected from 
a drop-down list, and more restrictive continuous-time 
Markov chain (CTMC) models can be specified by tai- 
loring parameters values. Coding sequences can be sim- 
ulated following the Goldman and Yang model of codon 
evolution (GY94; [21]), which is parameterized in terms 
of a non-synonymous and synonymous substitution rate 
ratio (dN/dS or co) and a transition/transversion rate 
ratio (k) or following the Muse and Gaut model (MG94; 
[22]). Several empirical amino acid substitution models 
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are implemented, including the Dayhoff [23], JTT [24], 
BLOSUM [25], WAG [26] and LG [27] model Equilibrium 
frequencies can be specified for all substitution models as 
well as among- site rate heterogeneity through the widely- 
used discrete-gamma distribution [28] and proportion of 
invariant sites [29] . 

An important feature of 7rBUSS is the ability to set up 
an arbitrary number of partitions for the sequence data 
and associate independent substitution models to them. 
Such settings may reflect codon position-specific evolu- 
tionary patterns or approximate genome architecture with 
separate substitution patterns for coding and non-coding 
regions. Partitions may also be set to evolve along dif- 
ferent phylogenies, which could be used, for example, to 
investigate the impact of recombination or to assess the 
performance of recombination detection programs in spe- 
cific cases. Finally, partitions do not need to share the 
exact same taxa (e.g. reflecting differential taxon sam- 
pling), and in partitions where a particular taxon is not 
represented the relevant sequence will be padded with 
gaps. 

tt BUSS is equipped with the ability to simulate evolu- 
tionary processes on trees calibrated in time units. Under 
the strict clock assumption, this is achieved by spec- 
ifying an evolutionary rate parameter that scales each 
branch from time units into substitution units. 7TBUSS 
also supports branch-specific scalers drawn indepen- 
dently and identically from an underlying distribution (e.g. 
log normal or inverse Gaussian distributions), modeling 
an uncorrelated relaxed clock process [13]. Simulations do 
not need to accommodate an explicit temporal dimension 
and input trees with branch lengths in substitution units 
will maintain these units with the default clock rate of 1 
(substitution/per site/per time unit). 

The data types and models described above are avail- 
able through the tt BUSS GUI or CLI, but additional data 
types and more complex models can be specified directly 
in an XML file. This allows, for example, simulating any 
discrete trait, e.g. representing phylogeographic locations, 
under reversible and nonreversible models [15,30], with 
potentially sparse CTMC matrices [15], as well as sim- 
ulating a combination of sequence data and such traits. 
As an example of available model extensions is the abil- 
ity to specify different CTMC matrices over different time 
intervals of the evolutionary history, allowing for example 
to model changing selective constraints through differ- 
ent codon model parameterizations or seasonal migration 
processes for viral phylogeographic traits [31]. 

Results and discussion 

We have developed a new simulation tool, called n BUSS, 
that we consider to be a rejuvenation of Seq-Gen [7], 
with several extensions to better integrate with the BEAST 
inference framework. Compared to Seq-Gen and other 



simulation software (Table 1), 7rBUSS covers a relatively 
wide range of models while, similar to Mesquite, offer- 
ing a cross-platform, user- friendly GUI. 7TBUSS is imple- 
mented in the Java programming language, and therefore 
requires a Java runtime environment, and depends on 
the BEAGLE library. Although speed is unlikely to 
be an impeding factor in most simulation efforts, the 
core implementation using the BEAGLE library provides 
substantial increases in speed for large-scale simula- 
tions, in particular when invoking multi-core architec- 
ture to produce highly partitioned synthetic sequence 
data. 

Program validation 

We validate 7rBUSS in several ways. First, we compare 
the expected site probabilities, as calculated using tree 
pruning recursion [56], with the observed counts resulting 
from 7rBUSS simulations. To this purpose, we calculate 
the probabilities for all 4 3 possible nucleotide site pat- 
terns observed at the tips of a particular 3-taxon topology 
using an HKY model with a discrete gamma distribution 
to model rate variation among sites. We then compare 
these probabilities to long-run (n = 100,000) site pat- 
tern frequencies simulated under this model and observe 
good correspondence in distribution (Pearsons x 2 test, 
p = 0.42). 

We also perform simulations over larger trees and esti- 
mate substitution parameters (e.g. k in the HKY model) 
using BEAST for a large number of replicates. Not only 
do the posterior mean estimates agree very well with 
the simulated values, but we also find close to nominal 
coverage, and relatively small bias and variance (mean 
squared error). These good performance measures have 
also recently been demonstrated for more complex substi- 
tution processes [31]. 

Example application 

We illustrate the use of simulating sequence data along 
time-calibrated phylogenies to explore the limitations 
of estimating old divergence times for rapidly-evolving 
viruses. Wertheim and Kosakovsky Pond [57] examine 
the evolutionary history of Ebola virus from sequences 
sampled over the span of three decades. Although main- 
taining remarkable amino acid conservation, the authors 
estimate nucleotide substitution rates on the order of 
10 -3 substitutions/per site/per year and a time to most 
recent common ancestor (tMRCA) of about 1,000 years 
ago. These estimates suggest a strong action of purifying 
selection to preserve amino acid residues over longer evo- 
lutionary time scales, which may not be accommodated 
by standard nucleotide substitution models. The authors 
demonstrate that accounting for variable selective pres- 
sure using codon models can result in substantially older 
origins in such cases. 



Table 1 Comparison between a selection of sequence simulation packages 
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1 7i BUSS: GY94, MG94; PhyloSim: GY94 x MO - M4; Recodon: GY94 x MO, Ml, M7, M8; NetRecodon: GY94 x MO, M 1 , M7, M8; Indelible: GY94 x MO - M 1 0; Evolver: GY94 x MO, M 1 , M2, M3, M7, M8; ALF: GY94 x MO, M 1 , M7 and M8; 
GenomePop: MG94. 

2 ttBUSS: BLOSUM [25], CPREV [45], Dayhoff [23], FLU [46], JTT [24], LG [27], MTREV [47], WAG [26]; Seq-Gen: JTT, WAG, 

PAM [48], BLOSUM, MTREV; indel-Seq-Gen2: PAM, JTT, MTREV, CPREV; PhyloSim: CPREV, JTT, LG , MTART [49], MTMAM [50], MTREV24 [51], MTZOA [52], PAM, WAG; Indelible: Dayhoff, JTT, WAG, VT [53], LG, BLOSUM, MTMAM, 
MTREV, MTART, CPREV, RTREV [54], HIVb [55], HIVw [55]; Evolver: Dayhoff, JTT, WAG, MTMAM, MTREV; ProteinEvolver: BLOSUM, CPREV, Dayhoff, HIVb, HIVw, JTT, Jones [24], LG, MTART, MTMAM, MTREV24, RTREV, VT, WAG; 
ALF: PAM,GTT,LG,WAG; SIMPROT: PAM, JTT, PMB. 

3 7iBUSS: demography; Recodon: recombination, migration, demography; NetRecodon: recombination, migration, demography; Mesquite: speciation; ProteinEvolver: recombination, migration, demography; SIMCOAL: 
demography and migration. 

4 PhyloSim: R package; Indelible: Executables for Windows and MacOS; ALF: Web interface; GenomePop: Executables for Windows and Linux; SIMCOAL: Executables for Windows; SIMPROT: GUI, Web interface; SIMPROT: 
Executables for Windows and Linux. 

5 Forward simulation including recombination, demography and migration. 
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Here, we explore the effect of temporally varying 
selection pressure throughout evolutionary history on 
estimates of the tMRCA using nucleotide substitution 
models. In particular, we model a process that is char- 
acterized by increasingly stronger purifying selection as 
we go further back in to time. To this purpose, we set 
up an epoch model' that specifies different GY94 codon 
substitution processes along the evolutionary history [31], 
and parameterize them according to a log-linear relation- 
ship between time and co. Specifically, we let the process 
transition from co = 1.0, 0.2, 0.1, 0.02, 0.01, 0.002, and 
0.001 at time = 10, 50, 100, 500, 1000 and 5000 years in 
the past, respectively. We simulate a constant population 
size genealogy of 50 taxa, sampled evenly during a time 
interval of 25 years, and simulate sequences according to 
the time-heterogeneous codon substitution process with a 
constant clock rate of 3 x 10 -3 codon substitutions/codon 
site/year. We simulate 100 replicates over genealogies with 
varying tMRCAs, by generating topologies under different 
population sizes parameterized by the product of effec- 
tive population size (N e ) and generation time scaled in 
years (r): 

N e x r = 1, 5, 10, 50, 100, 500, 1000 

We note that under this model, trees with tMRCAs of 
about 10,000 years still result in sequences with a notice- 
able degree of homology (resulting in a mean amino acid 
distance of about 0.5, which is in the same range of 
the mean amino acid distance for sequences represen- 
tative of the primate immunodeficiency virus diversity). 



Using a constant co of 0.5 on the other hand results 
in fairly randomized sequences. We subsequently ana- 
lyze the replicate data using a codon position partitioned 
nucleotide substitution model in BEAST and plot the cor- 
respondence between simulated and estimated tMRCAs 
in Figure 2. 

Our simulation exercise shows that a linear relationship 
between simulated and estimated tMRCAs only holds for 
100 to 200 years in the past, and estimates quickly level off 
after about 1000 years in the past. This can be explained 
by the unaccounted decline in amino acid substitutions 
and saturation of the synonymous substitutions as we 
go further back in time. Although we are not claiming 
that evolution occurs quantitatively or even qualitatively 
according to the particular process we simulate under, and 
we ignore other confounding factors (such as potential 
selective constraints on non-neutral synonymous sites), 
this simulation does conceptualize some of the limita- 
tions to estimating ancient origins for rapidly evolving 
viruses that experience strong purifying selection over 
longer evolutionary time scales. 

Conclusion 

7TBUSS provides simulation procedures under many 
evolutionary models or combinations of models avail- 
able in the BEAST framework. This feature facilitates 
the evaluation of estimator performance during the 
development of novel inference techniques and the gen- 
eration of predictive distributions under a wide range 
of evolutionary scenarios that remain critical for testing 
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competing evolutionary hypotheses. Combinations of dif- 
ferent evolutionary models can be accessed through a 
GUI or CLI, and further extensions can be specified in 
XML format with a syntax familiar to the BEAST user 
community. Analogous to the continuing effort to sup- 
port model set-up for BEAST in BEAUti, future releases 
of 7TBUSS aim to provide simulation counterparts to the 
BEAST inference tools, both in terms of data types and 
models, while also maintaining general purposes simula- 
tion capabilities. Interesting targets include discrete traits, 
which can already be simulated through XML specifica- 
tion, continuously-valued phenotype data [58] and indel 
models. Finally, 7TBUSS provides opportunities to pursue 
further computational efficiency through parallelization 
on advancing computing technology. We therefore hope 
that tt BUSS will further stimulate the development of 
sequence and trait evolutionary models and contribute 
to advancement of our knowledge about evolutionary 
processes. 

Availability and requirements 

Project name: 7rBUSS; 

Project home page: www.rega.kuleuven.be/cev/ecv/ 
software/pibuss; 

Operating system(s): Platform independent; 
Programming language: Java; 

Other requirements: Java 1.5 or higher, BEAGLE library; 
License: GNU Lesser GPL; 

Any restrictions to use by non-academics: None. 

Source code of the parallel BEAST/BEAGLE utility for 
sequence simulation is freely available as part of the 
BEAST Google Code repository: www.code.google.com/ 
p/beast-mcmc/. 

The Broad-platform Evolutionary Analysis General 
Likelihood Evaluator (BEAGLE) library has its both 
source code and binary installers available from www. 
code.google.com/p/beagle-lib. 

Scripts and input files required for repeating the simu- 
lation study presented in Example application are hosted 
at www.github.com/phylogeography/DeepRoot. 
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